Frontiers in Digital Health — Latest Matching Preprints

1

Evaluating Voice-Enabled Generative AI for Mental Health: Real-Time Performance and Safety Analyses

Ngo, N.; Sano, A.

2025-11-17 health informatics 10.1101/2025.11.14.25340246 medRxiv

Top 0.1%

28.0%

Show abstract

This study investigates the integration of Voice AI into a locally hosted generative AI chatbot designed to function as a mental health assistant, with the goal of enabling intuitive, voice-based therapeutic interaction. Leveraging the Llama3.1 8B language model for privacy-preserving generation, the system combines Deepgrams Speech-to-Text API and OpenAIs Text-to-Speech API within a WebRTC-based framework to support low-latency, bi-directional communication. A custom pipeline facilitates real-time voice input and output, aiming to reduce barriers to engagement and foster a more natural conversational flow. Technical evaluation focuses on latency across short, long-form, and multi-turn dialogues, revealing response times within tolerable bounds for synchronous use. Prompt engineering and system prompt customization guide empathetic, context-aware responses in standard therapeutic scenarios, though limitations persist in handling edge cases. These findings suggest that locally hosted voice-enabled LLMs can support responsive, privacy-conscious mental health applications, with future work directed toward fine-tuning for high-risk interactions.

2

Infusing behavior science into large language models for activity coaching

Hegde, N. G.; Vardhan, M.; Nathani, D.; Rosenzweig, E.; Seneviratne, M.; Karthikesalingam, A.

2023-04-03 health systems and quality improvement 10.1101/2023.03.31.23287995 medRxiv

Top 0.1%

23.4%

Show abstract

Large language models (LLMs) have shown promise for task-oriented dialogue across a range of domains. The use of LLMs in health and fitness coaching is under-explored. Behavior science frameworks such as COM-B, which conceptualizes behavior change in terms of capability (C), Opportunity (O) and Motivation (M), can be used to architect coaching interventions in a way that promotes sustained change. Here we aim to incorporate behavior science principles into an LLM using two knowledge infusion techniques: coach message priming (where exemplar coach responses are provided as context to the LLM), and dialogue re-ranking (where the COM-B category of the LLM output is matched to the inferred user need). Simulated conversations were conducted between the primed or unprimed LLM and a member of the research team, and then evaluated by 8 human raters. Ratings for the primed conversations were significantly higher in terms of empathy and actionability. The same raters also compared a single response generated by the unprimed, primed and re-ranked models, finding a significant uplift in actionability from the re-ranking technique. This is a proof of concept of how behavior science frameworks can be infused into automated conversational agents for a more principled coaching experience. Institutional Review Board (IRB)The study does not involve human subjects beyond the volunteer annotators. IRB approval was not sought for this research.

3

Title: STELLA: Safety Testing Engine for Large Language Assistants

Perlis, R. H.; Bin Adil, A.; Dobyns, K.

2025-12-15 health informatics 10.64898/2025.12.11.25342078 medRxiv

Top 0.1%

18.6%

Show abstract

BackgroundAssistants incorporating large language models are increasingly applied in the context of health care, where they represent a promising means of expanding access to care. However, there is growing recognition of the risks that these chatbots may fail to respond appropriately to individuals in crisis, and may adversely affect mental health in some circumstances. MethodsWe developed and implemented an automated system for assessing voice or text AI assistant response to users across a range of health scenarios. This set of tools incorporates simulated users with a specified set of characteristics; scenarios in which they interact with a chatbot over multiple rounds; and designs that allow multiple cohorts to be compared. Study designs including simulated randomized trials can be generated via natural language prompts. Chatbot session transcripts are then quantified in terms of safety, efficacy, and user engagement according to prespecified rubrics and exemplars with an ensemble of judging language models, allowing specific exchanges to be flagged for manual review. To illustrate this approach, we assessed 10 safety scenarios in 11 frontier language model chatbots, including Claude Opus 4.5, ChatGPT-5.2, and Gemini 3, using 5 personas, each followed over 10 exchanges, with a subset assessed for an additional 5 personas. ResultsTotal proportion of responses flagged for possible harmful content ranged from 3.2% (95% CI 2.0-5.1%) for GPT 5.2 to 34.0% (95% CI 30.0-38.3%) for Grok-4.1-fast-non-reasoning. Total proportion of responses flagged for failing to provide beneficial content ranged from 19.6% (95% CI 16.4-23.3%) for GPT 5.2 to 66.0% (95% CI 61.7-70.0%) for Grok-4.1-fast-reasoning. In aggregate, proportion of unsafe content increased across turns - for failure to provide beneficial content, by 0.7% per turn (95% CI 0.3%-1.1%). ConclusionA simulation-based test harness can facilitate the rapid characterization and comparison of large language model assistant performance according to standardized rubrics. Existing frontier models vary substantially on these metrics. Simulation strategies such as this one may accelerate efforts to ensure that chatbots yield benefit rather than harm to users who seek to apply them to address mental health and well-being.

4

LLM-Assisted Taxonomy and Temporal Analysis of Provider Questions About HIV in provider-to-provider telehealth

Zareei Shams Abadi, A. E.; Becevic, M.; Dandachi, D.

2025-12-27 health informatics 10.64898/2025.12.19.25342694 medRxiv

Top 0.1%

17.6%

Show abstract

IntroductionOngoing education in HIV care is limited for many healthcare providers working in rural and non-academic settings, which can reduce patients access to high-quality care. To guide targeted tele-mentoring and continuing education, we analyzed questions submitted by clinicians during Extension for Community Healthcare Outcomes (ECHO) sessions to characterize learning needs and thematic trends. MethodsWe reviewed 78 clinical questions submitted during Project ECHO sessions and developed a structured classification of topics raised by clinicians. Using text-embedding representations and large language models (LLMs), we explored automated approaches to categorize questions and identify thematic clusters. Analyses compared the distribution of topics across professional roles to detect role-specific learning needs. ResultsDistinct topic patterns emerged by clinician type. Physicians and pharmacists most often asked about initiating and modifying antiretroviral therapy (ART). Nurse practitioners focused on ART and adherence support, while allied health professionals and PAs raised social-support and care-navigation issues. Medication-related questions frequently highlighted adherence concerns and ART change considerations. DiscussionECHO questions reveal clear, role-dependent learning needs that can inform targeted tele-mentoring. LLM-based embeddings provided a practical, scalable way to classify questions and monitor trends, supporting more tailored HIV training for different provider groups.

5

Model Development and Real-World Deployment of Multimodal Input-Based Subtyping of Depression in Tele-Counseling for Scalable Mental Health Assessment

Francis, A. J. A.; Raza, A.; Patel, N.; Gajbhiye, R.; Kumar, V.; T, A.; Saikia, A.; Mibang, O.; K, V.; Joshi, K.; Tony, L.; Balasubramani, P. P.

2026-02-18 psychiatry and clinical psychology 10.64898/2026.02.11.25342657 medRxiv

Top 0.1%

17.1%

Show abstract

The rapid growth of tele-counseling and the use of lay counselors in high-volume, low-resource mental health services has created a need for scalable tools for early detection and triage. Effective personalization now requires stratifying individuals by dominant symptom profiles, such as appetite, agency, anxiety, and sleep disturbances. Depression symptoms vary widely, even among those with similar scores, reflecting distinct psychophysiological and cognitive-affective patterns. In tele-mental-health settings, where contextual cues are limited, multimodal behavioral signals from natural interactions can complement traditional assessments. Using synchronized audio, video, and text data from the EDAIC dataset (N=275), we propose a multimodal learning framework to classify five clinically validated outcomes: Depression, Appetite disturbance, Agency impairment, Anxiety, and Sleep problems. We developed a comprehensive multimodal machine-learning pipeline, incorporating automated dataset construction, modality-specific feature extraction (acoustic, facial action unit, linguistic), and supervised learning with cross-validation. Labels were derived from validated scoring rules to ensure clinical relevance. Sentiment analysis revealed lower sentiment scores in participants with high Depression, Anxiety, or Agency scores, but no significant differences in Appetite or Sleep severity. Model performance was assessed across three scenarios: text (transcripts), phone calls (audio + transcript), and video calls (audio + video + transcript). Temporal models (CNN+BiLSTM) achieved over 65% accuracy across modalities, while a fine-tuned temporal model for depression detection using video calls reached an accuracy of 81% with an f1-score of 0.79, demonstrating that our approach performs on par with state-of-the-art methods. XGBoost excelled in phone and video calls, while Ridge classifiers performed best for text-based inputs. SHAPley analysis identified key audio and video features for detecting Depression and other symptoms. A translational avatar-based interface validated system operability, demonstrating the potential for scalable, objective mental-health assessment in tele-counseling.

6

Benchmarking And Datasets For Ambient Clinical Documentation: A Review Of Existing Frameworks And Metrics For AI-Assisted Medical Note Generation

Gebauer, S.

2025-01-29 health informatics 10.1101/2025.01.29.25320859 medRxiv

Top 0.1%

16.9%

Show abstract

BackgroundThe increasing adoption of ambient artificial intelligence (AI) scribes in healthcare has created an urgent need for robust evaluation frameworks to assess their performance and clinical utility. While these tools show promise in reducing documentation burden, there remains no standardized approach for measuring their effectiveness and safety. ObjectiveTo systematically review existing evaluation frameworks and metrics used to assess AI-assisted medical note generation from doctor-patient conversations, and provide recommendations for future evaluation approaches. MethodsA scoping review following PRISMA guidelines was conducted across PubMed, IEEE Explore, Scopus, Web of Science, and Embase to identify studies evaluating ambient scribe technology between 2020-2025. Studies were included if they were peer-reviewed, focused on clinical ambient scribe evaluation from speaking to note production, and described an evaluation approach. Extracted data included evaluation metrics, benchmarking approaches, dataset characteristics, and model performance. ResultsSeven studies met inclusion criteria. Evaluation approaches varied widely, from traditional natural language processing metrics like ROUGE and BERTScore to domain-specific measures such as clinical accuracy and bias. Critical gaps identified include: 1) wide diversity of evaluation metrics making cross-study comparison challenging, 2) limited integration of clinical relevance in automated metrics, 3) lack of standardized approaches for crucial metrics like hallucinations and errors, and 4) minimal diversity in clinical specialties evaluated. Only two datasets were publicly available for benchmarking. ConclusionsThis review reveals significant heterogeneity in how ambient scribes are evaluated, highlighting the need for standardized evaluation frameworks. We propose recommendations for developing comprehensive evaluation approaches that combine automated metrics with clinical quality measures. Future work should focus on creating public benchmarks across diverse clinical settings and establishing consensus on critical safety and quality metrics.

7

A Digital Biomarker Dataset in Hematopoietic Cell Transplantation: A Longitudinal Study of Caregiver-Patient Dyads (dHCT)

Jalin, A.; Swatthong, N.; Rozwadowski, M.; Kumar, R.; Barton, D.; Braun, T.; Carlozzi, N.; Hanauer, D. A.; Hassett, A.; Choi, S. W.

2024-11-22 health informatics 10.1101/2024.11.21.24317641 medRxiv

Top 0.1%

14.9%

Show abstract

BackgroundHematopoietic stem cell transplantation (HCT) is a potentially life-saving therapy for individuals with blood diseases, but involves a challenging recovery process that requires dedicated caregivers. The complex interplay between emotional distress, care partner (or unpaid caregiver) burden, and treatment outcomes necessitates comprehensive physiological and psychological measurements to fully understand these dynamics. FindingsWe collected longitudinal data from 166 HCT caregiver-patient dyads over 120 days post-transplant as part of a randomized controlled trial (NCT04094844). Data were gathered using the Fitbit(R) Charge 3 device, a custom mood-reporting app with positive psychology-based activities (Roadmap), PROMIS(R) health measures, and clinical events. The dataset includes minute-level heart rate, daily sleep metrics, step counts, self-reported mood scores, app usage metrics, PROMIS(R) T-scores (i.e. global health, depression), infection and readmission records, and clinical outcomes (e.g., acute and chronic graft-versus-host disease, relapse, mortality). Physiological data were available for both caregivers and patients. Data validation confirmed high compliance with mood reporting and the presence of physiological patterns between caregivers and patients that differed (i.e., lower activity in patients compared with caregivers across time). ConclusionsThis dataset offered an unprecedented view into the daily fluctuations of caregiver and patient well-being throughout the critical post-HCT period. It provided a valuable resource for researchers investigating the impact of mHealth app interventions, including emotional distress and physiological markers on treatment course and clinical outcomes. Our unique dataset informs interventions that may address caregiver support, patient care, or dyadic-focused strategies and enable novel analyses of single member or dyadic dynamics in HCT treatment.

8

The smartphone: an evolution or revolution in virtual patient healthcare during and beyond the COVID-19 pandemic ? An evaluation and comparison of the smartphone against other currently available wearable technologies in a secondary care setting during the COVID-19 pandemic.

Raza, A.; Mukherjee, S.; Patel, V.; Kamal, N.; Lichtarowicz-Krynska, E.

2020-11-10 health informatics 10.1101/2020.11.06.20223206 medRxiv

Top 0.1%

14.7%

Show abstract

Smartphones are now commonly used, for virtual outpatient consultations, to help reduce disease transmission during the COVID-19 pandemic. Nosocomial spread of COVID 19 and hospital acquired infections are usually from staff or students to patients. Reducing non- essential staff numbers on ward rounds may reduce the risk. We describe the novel use of smartphones, with Microsoft Teams, to live stream inpatient interactions, radiological images, pathology results, charts and patient review between an office-based and ward team (virtual ward round) and for teaching medical students in secondary care. After Research and Ethics, Digital services and Information Governance approval we compared a smartphone and head-worn device (Realwear HMT-1). Data collection was by participant questionnaire. Statistical analysis was performed using the Mann - Whitney test. There was no statistically significant difference in audio and video feed quality between the smart phone (p value = 0.3) and Realwear device (p value = 0.41). However the smartphone was preferred during ward rounds and was 85% cheaper than the Realwear device. Urology medical staff numbers on the ward were reduced by 50%. Ward round efficiency improved as administrative tasks could be performed by the office team during the virtual ward round. Virtual ward rounds using smartphones can facilitate remote communication between staff, students and patients. Staff in isolation or shielding can also assist front line colleagues from home. Smarter use of the smart phone may help reduce staff numbers on wards and reduce the number COVID-19 and nosocomial infections, potentially reducing morbidity and mortality locally and globally.

9

Recovering Clinical Detail in AI-Generated Responses for Low Back Pain Through Prompt Design

Basharat, A.; Hamza, O.; Rana, P.; Odonkor, C. A.; Chow, R.

2026-04-23 pain medicine 10.64898/2026.04.21.26351437 medRxiv

Top 0.1%

14.6%

Show abstract

Introduction Large language models are increasingly being used in healthcare. In interventional pain medicine, clinical reasoning is essential for procedural planning. Prior studies show that simplified prompts reduce clinical detail in AI-generated responses. It remains unclear whether this reflects knowledge loss or simply prompt-driven suppression of information. Methods We performed a controlled comparative study using 15 standardized low back pain questions representing common interventional pain questions. Each question was submitted to ChatGPT under three conditions, professional-level prompt (DP), fourth-grade reading-level prompt (D4), and clinician-directed rewriting of the D4 response to a medical level (U4[->]MD). No follow-up prompting was allowed. Three physicians independently rated responses for accuracy using a 0-2 ordinal scale. Clinical completeness was determined by consensus. Word count and Flesch-Kincaid Grade Level (FKGL) were also measured. Paired t-tests compared conditions. Results Accuracy was highest with professional prompting (1.76). Accuracy declined with the fourth-grade prompt (1.33; p = 0.00086). When simplified responses were rewritten for clinicians, accuracy returned to baseline (1.76; p {approx} 1.00 vs DP). Clinical completeness followed the same pattern showing DP 80.0%, D4 6.7%, U4[->]MD 73.3%. Fourth-grade responses were shorter and less complex. Upscaled responses were more complex and similar in length to professional responses. Inter-rater reliability was low (Fleiss {kappa} = 0.17), but trends were consistent across conditions. Conclusions Reduced clinical detail under simplified prompts appears to reflect constrained output rather than loss of knowledge. Clinician-directed reframing restores omitted content. LLM performance in interventional pain depends strongly on prompt design and intended audience.

10

Personalized Machine Learning using Passive Sensing and Ecological Momentary Assessments for Meth Users in Hawaii: A Research Protocol

Washington, P.

2023-08-25 health informatics 10.1101/2023.08.24.23294587 medRxiv

Top 0.1%

14.4%

Show abstract

BackgroundArtificial intelligence (AI)-powered digital therapies which detect meth cravings delivered on consumer devices have the potential to reduce these disparities by providing remote and accessible care solutions to Native Hawaiians, Filipinos, and Pacific Islanders (NHFPI) communities with limited care solutions. However, NHFPI are fully understudied with respect to digital therapeutics and AI health sensing despite using technology at the same rates as other races. ObjectiveWe seek to fulfill two research aims: (1) Understand the feasibility of continuous remote digital monitoring and ecological momentary assessments (EMAs) in NHFPI in Hawaii by curating a novel dataset of longitudinal FitBit biosignals with corresponding craving and substance use labels. (2) Develop personalized AI models which predict meth craving events in real time using wearable sensor data. MethodsWe will develop personalized AI/ML (artificial intelligence/machine learning) models for meth use and craving prediction in 40 NHFPI individuals by curating a novel dataset of real-time FitBit biosensor readings and corresponding participant annotations (i.e., raw self-reported substance use data) of their meth use and cravings. In the process of collecting this dataset, we will glean insights about cultural and other human factors which can challenge the proper acquisition of precise annotations. With the resulting dataset, we will employ self-supervised learning (SSL) AI approaches, which are a new family of ML methods that allow a neural network to be trained without labels by being optimized to make predictions about the data itself. The inputs to the proposed AI models are FitBit biosensor readings and the outputs are predictions of meth use or craving. This paradigm is gaining increased attention in AI for healthcare. ConclusionsWe expect to develop models which significantly outperform traditional supervised methods by fine-tuning to an individual subjects data. Such methods will enable AI solutions which work with the limited data available from NHFPI populations and which are inherently unbiased due to their personalized nature. Such models can support future AI-powered digital therapeutics for substance abuse.

11

A Review of Point-of-Care Devices for Blood-Testing Towards AI-driven Remote Digital Care, Precision Healthcare and Predictive Medicine

Gu, J.; Zenil, H.

2025-12-15 health informatics 10.64898/2025.12.13.25340658 medRxiv

Top 0.1%

14.3%

Show abstract

Point-of-care (POC) blood testing enables rapid, decentralized diagnostics with transformative promise, yet its innovation landscape remains poorly mapped. To this end, we focused on features that we believe are key to make progress in areas of precision healthcare and predictive medicine, such as longitudinal data collection and data analytics integration. While no review can be complete, this work attempts to address this gap by analyzing 86 POC blood testing devices worldwide and proposing a unified framework to compare them across technology principles, diagnostic breadth, usability, regulatory pathway, deployment feasibility (via a custom index), and data/AI integration. Electrochemical biosensors were the single largest platform (29.1%), strongly associated with glucose testing ({chi}2=237.8, p<0.001), while spectroscopic and microfluidic systems remained niche due to higher costs and specialized requirements. Regulatory approval skewed toward moderate risk (44.2% FDA II; 27.4% IVDR C), while approval times lengthened with risk class (e.g., IVDR D {approx}540 days). A trade-off was observed between usability and panel breadth: tools for home or low-resource settings emphasize simplicity and affordability, whereas clinical systems expand diagnostic range at higher complexity and cost. Deployment feasibility scores favored handhelds, while benchtops were penalized by workflow and capital demands, and microfluidics by consumables. Innovation clusters in North America, Europe, and East Asia reinforce global leadership and disparities.

12

AI-Generated Clinical Summaries: Errors and Susceptibility to Speech and Speaker Variability

Draper, T. C.; Leake, J.; Cox, T.; Lamb-Riddell, K.; Johns, B. E.; McCormick, J.; Trowell, S.; Kiely, J.; Luxton, R.

2025-10-30 health informatics 10.1101/2025.10.29.25339041 medRxiv

Top 0.1%

14.3%

Show abstract

Summary BoxO_ST_ABSWhat is already known on this topicC_ST_ABSO_LIClinical AI Scribe outputs can contain errors, and the impact of human factors (e.g. communication style, accents, speech impairments) in clinical contexts remains under-characterised. C_LI What this study addsO_LIIn controlled simulations, patient personality and accent did not significantly alter total CAIS errors, with omissions predominating and hallucinations/inaccuracies remaining low. C_LIO_LISpeech-impairment effects were highly varied, with near-perfect recognition for cleft palate and vowel disorders, whereas phonological impairment substantially reduced accuracy. C_LI How this study might affect research, practice or policyO_LISupports clinician-in-the-loop deployment with local validation across representative accents and impairment profiles, prioritising detection of clinically critical errors. C_LIO_LIRoutine governance should include subgroup performance reporting (accents, impairments) and ongoing audit of error rates. C_LI ObjectiveThe study aims to evaluate whether variability in patients communication style (personality, international English accents, and speech impairments) affects the accuracy of a Clinical AI Scribe (CAIS), and to identify where performance degrades. Method and AnalysisWe conducted simulated primary-care consultations in a purpose-built lab using trained actors. To investigate personality types, four scenarios were enacted, each with five patient-personality types. For accents, human-verified transcripts of consultations were used to generate all doctor/patient combinations of seven different accents (including a synthetic reference voice) across five scenarios. The CAIS produced SOAP-structured summaries that were compared with the transcripts. Errors were classified as omissions, factual inaccuracies, or hallucinations. For speech impairments, public recordings representing five profiles were transcribed and word-recognition accuracy was calculated. ResultsPersonality types showed no statistically significant differences in errors (all p>0.05). Extraversion had the highest total errors (median 3.5), while conscientiousness and agreeableness were lower (1.5 and 2.0, respectively). Across accents, both pairwise tests and group comparisons were non-significant for both patient and doctor voices (patients: p=0.851; doctors: p=0.98). Omissions predominated, with low rates of hallucinations and factual inaccuracies. Omissions were slightly higher for Chinese- and Indian-accented doctors (both medians 3.0). In contrast, speech impairments differed: cleft palate and vowel disorders were near-perfect, whereas phonological impairment markedly reduced recognition (p<0.001). ConclusionsUnder controlled conditions, CAIS performance was broadly stable across communication styles and most accents but remained vulnerable to specific speech characteristics, particularly phonological impairment. Future evaluations using real-world, multi-speaker clinical audio are needed to confirm performance.

13

Adherence Risk Stratification in Physiatry: A Multivariate Analysis of Factors in Community-Based Care Using Algorithmic Modeling Techniques

Trinh, H.; Kounang, R.

2025-04-28 rehabilitation medicine and physical therapy 10.1101/2025.04.26.25326474 medRxiv

Top 0.1%

14.2%

Show abstract

Missed appointments represent a double-edged sword in community health settings. Policies designed to retain patients and ensure continuity of care for vulnerable populations often mean that discharging patients is rare, even in cases of frequent no-shows. However, this retention strains healthcare resources, disrupts workflows, and exacerbates inequities in access to care. In physiatry (PM&R), where rehabilitation outcomes depend on consistent patient engagement, missed visits further hinder progress, delaying recovery and diminishing quality of life. Addressing appointment adherence in these settings is paramount for equitable and efficient care delivery. This study evaluated appointment adherence using EPIC-derived general risk scores (demographics, clinical history, and other individual-level factors) and preventative gap scores (compliance with recommended preventative care guidelines). To add granularity, demographic variables such as age, sex, and race/ethnicity--Social Drivers of Health (SDOH) factors embedded within these risk scores--were further analyzed as an additional layer to identify structural and systemic barriers influencing patient engagement. A Residual Deep Neural Network (RDNN) was developed, achieving an AUC-ROC of 0.997, recall of 0.988, F1-score of 0.987, and accuracy of 0.980. A Deep Neural Network with Attention (DNNA) was introduced for interpretability, offering opportunities to refine and extend RDNNs predictive performance. It demonstrated a 5.7x improvement over a clamped baseline for no-show risk prediction. These findings emphasize the strengths of combining RDNNs robust predictive capabilities with DNNAs ability to model nuanced relationships. Together, they provide a pathway to optimize appointment adherence and enhance equitable care delivery in community health and PM&R settings.

14

Toward Trustworthy Chatbots: A Protocol for Red Teaming for Health Related Conversations

Hussain, S.-A.; Jackson, D. I.; Lewis, A.; Fosler-Lussier, E.; Sezgin, E.

2025-12-16 health informatics 10.64898/2025.12.15.25342297 medRxiv

Top 0.1%

14.0%

Show abstract

IntroductionHealth-related chatbots are increasingly used to mediate conversations that carry clinical significance and emotional weight. Retrieval-augmented generation (RAG) can reduce factual errors ("hallucinations"), but the risks remain, with additional challenges coming from chatbots acting against behavioral safety and scope rules. Red teaming, an adversarial testing process that deliberately probes systems for failures before deployment, offers a way to surface potential risks. We describe a task-informed red-teaming protocol for health-related and patient-facing chatbots.. MethodsOur protocol is composed of an error stratification, single and multi-turn attack evaluation, and a framework for mitigation techniques. We define an error framework that distinguishes Knowledge Adherence (KA: staying faithful to retrieved documents) from Behavioral Adherence (BA: following safety, tone, and scope instructions). Our single-turn attacks consist of seven attack vectors reflecting real-world pressures, including advice-seeking, user distress, and prompt injection. A subset of these vectors are evaluated in multi-turn attacks. We evaluate two mitigation strategies: (1) prompt augmentation, which adds explicit guardrails to the chatbot prompt, and (2) document augmentation, which adds a localized FAQ document to the retrieval corpus. Finally, we apply this protocol to a social care chatbot (specifically supporting Health-Related Social Needs (HRSN)), developed as an agentic workflow that queries a vetted HRSN resource index. The evaluation corpus comprises 140 single-turn probes and 20 multi-turn stress tests. We assess correctness and risk severity via human annotation. ResultsOur error framework identified that the primary safety risk was a failure to follow behavioral rules, rather than a lack of factual knowledge. Furthermore, multi-turn stress tests revealed critical vulnerabilities that single-turn testing missed, directly informing our choice of targeted mitigations. In single-turn tests, the chatbot was factually robust, yielding 0/60 KA errors; however, it struggled with behavioral instructions, producing a 15% (12/80) BA error rate, with 21% (4/19) of those being high-severity. Notable vulnerabilities included advice_query (BA 30%, 6/20) and prompt_injection (BA 20%, 4/20). User_distress triggered the hallucination of unverified contact details in 20% (4/20) of cases. In multi-turn stress tests, error rates rose sharply under conversational persistence: advice_query BA errors reached 50% (5/10) and user_distress reached 40% (4/10), accounting for all high-severity errors (4/4). Prompt augmentation reduced total errors across these vectors by 60% (15/60[->]6/60). Document augmentation eliminated all single-turn user_distress errors (to 0/20) and reduced advice_query errors (7/20[->]4/20). When combined in multi-turn tests, these mitigations eliminated high-severity errors entirely, reducing BA errors to 20% (advice_query) and 30% (user_distress) by forcing the chatbot into <safe failure> loops. ConclusionWe demonstrate that a protocol combining single-turn breadth, multi-turn depth, and layered mitigations materially improves chatbot safety and offers a practical template for patient-facing chatbots. Future work should expand on this protocol with chatbots in more diverse clinical domains, and with a larger panel of evaluators.

15

Unsupervised subgrouping of chronic low back pain patients treated in a specialty clinic

Torres Espin, A.; Keller, A.; Ewing, S.; Bishara, A.; Takegami, N.; Ferguson, A. R.; Scheffler, A.; Hue, T.; Lotzs, J.; Peterson, T.; Zheng, P.; O'Neill, C.

2023-11-05 pain medicine 10.1101/2023.11.04.23298104 medRxiv

Top 0.1%

12.7%

Show abstract

BackgroundChronic low back pain (cLBP) is the leading cause of disability worldwide. Current treatments have minor or moderate effects, partly because of the idiopathic nature of most cLBP cases, the complexity of its presentation, and heterogeneity in the population. Explaining this complexity and heterogeneity by identifying subgroups of patients is critical for personalized health. Clinical decisions tailoring treatment to patients subgroup characteristics and specific treatment responses can improve health outcomes. Current patient stratification tools divide cases into subgroups based on a small subset of characteristics, which may not capture many factors determining patient phenotypes. Methods and FindingsIn this study, we use an unsupervised machine learning framework to identify patient subgroups within a specialized back pain clinic and evaluate their outcomes. Our analysis identified 25 latent factors determining patient phenotypes and found three distinctive clusters of patients. The research suggests that there is heterogeneity in the population of patients treated in a specialty setting and that several factors determine patient phenotypes. Cluster 1 consists of those individuals with characteristics found to be protective of chronic pain: younger age, low pain medication prescription, high function, good insurance access, and low overlapping pain conditions. Individuals in Cluster 3 associate with older age and present with a higher incidence of chronic overlapping pain conditions, comorbidities, and pain medication use. Cluster 2 is an intermediate group. ConclusionsWe quantify cLBP population heterogeneity and demonstrate how ML analytical workflow can be used to explain, in part, this heterogeneity in relation to outcomes. Notably, considering a data-driven approach from multi-domain data produces different subgroups than the STarT back screening tool, and the addition of other functional metrics at baseline such as global physical and mental function, and pain intensity, increases the variance explained in outcomes. Our study provides novel insights into the complex nature of cLBP and the potential for data-driven methods to identify clinically relevant subtypes.

16

Stanford Screenomics: An Open-source Platform for Unobtrusive Multimodal Digital Trace Data Collection from Android Smartphones

Kim, I.; Boffa, J.; Cho, M.; Conroy, D. E.; Kline, N.; Haber, N.; Robinson, T. N.; Reeves, B.; Ram, N.

2025-06-26 health informatics 10.1101/2025.06.24.25329707 medRxiv

Top 0.1%

12.6%

Show abstract

Smartphone-based digital trace data can offer powerful insights for identifying behavioral patterns and health risks. However, existing tools for comprehensive data collection lack scalability, customizability, transparency, and accessibility. To address these gaps, we developed an open-source platform that enables in-situ capture of multimodal digital traces from smartphones (e.g., moment-by-moment capture of screenshots, application usage logs, interaction histories, and phone sensor readings). The Stanford Screenomics Data Collection application allows researchers to tailor data types and quality, data transfer methods, and upload cadence. The Dashboard application supports real-time monitoring of participants data provision, identification of data issues, and automated reactive communications to participants. The platforms back-end employs a NoSQL database for secure, and HIPAA-compliant storage. Using illustrative 24-hour digital trace data we demonstrate how the platform expands the range of possible digital phenotyping studies. Date of Submission: [06] 2025 This article has not been published before and is not currently being considered for publication.

17

A comparison of in-person and telehealth treatment modalities using the SpeechVive device

Covert, R.; Snyder, S.; Lambert, A.; Spremulli, M.; Brown, B.; Dwenger, K.; Malandraki, G.; McDonough, M.; Brosseau-Lapre, F.; Huber, J. E.

2025-01-17 rehabilitation medicine and physical therapy 10.1101/2025.01.16.25320611 medRxiv

Top 0.1%

12.6%

Show abstract

Telehealth is increasing popular as a treatment option for people with Parkinson disease (PD). The SpeechVive device is a wearable device that uses the Lombard effect to help patients speak more loudly, slowly, and clearly. This study sought to examine the effectiveness of the device to improve communication in people with PD, delivered over a telehealth modality as compared to in-person, using implementation science design. 66 people with PD were enrolled for 12 weeks with 34 choosing the in-person group and 32 in the telehealth group. Participants were assessed pre-, mid-, and post-treatment. Participants produced continuous speech samples on and off the device at each timepoint. Sound pressure level (SPL), utterance length, pause frequency, and total pause duration were measured. Psychosocial surveys were administered to evaluate the effects of treatment on depression, self-efficacy, and participation. The in-person group increased SPL when wearing the device while the telehealth group did not. Both groups paused less often while wearing the device. Utterance length increased post-treatment for the telehealth group, but not for the in-person group. An increase in communication participation ratings in the telehealth group, but not the in-person group, was the only significant change in the psychosocial metrics. The in-person group showed similar treatment effects as previous studies. The device was not as effective in the telehealth group. One limitation was data loss due to recording issues that impacted the telehealth group more than the in-person group.

18

Clinician Experiences with Ambient AI Scribe Technology in Singapore: A Qualitative Study

Shankar, R.; Goh, A.; Xu, Q.

2026-03-19 health informatics 10.64898/2026.03.17.26348627 medRxiv

Top 0.1%

12.6%

Show abstract

BackgroundThe administrative burden of clinical documentation is a recognised contributor to clinician burnout and diminished care quality. Ambient artificial intelligence (AI) scribe technology, which uses large language models to passively record and summarise clinical encounters, has rapidly gained traction internationally. However, no published studies have examined clinician experiences with this technology in the Asia-Pacific region or within Singapores multilingual healthcare system. ObjectiveThis study explored clinician perspectives on ambient AI scribe technology at Alexandra Hospital, Singapore, focusing on perceived benefits, barriers, workflow integration, ethical considerations, and recommendations for sustained implementation. MethodsA qualitative descriptive study was conducted using semi-structured interviews with 28 clinicians across multiple specialties at Alexandra Hospital, National University Health System (NUHS). Participants were purposively sampled for diversity in role, specialty, and usage level. Interviews were analysed using reflexive thematic analysis guided by the RE-AIM/PRISM framework. The COREQ checklist was followed. ResultsFive themes emerged: (1) reclaiming presence in the clinical encounter, (2) navigating accuracy and trust in AI-generated documentation, (3) workflow disruption and adaptation, (4) privacy, consent, and ethical tensions within Singapores regulatory landscape, and (5) envisioning sustainable integration. Clinicians reported improved patient engagement and reduced cognitive burden. Persistent barriers included accuracy concerns, AI hallucinations, limited multilingual functionality, loss of documentation style, and uncertainties around compliance with the Personal Data Protection Act (PDPA). ConclusionsAmbient AI scribe technology holds promise for alleviating documentation burden in Singapores public healthcare system. Realising this potential requires attention to safety validation, multilingual capability, clinician training, and patient-centred consent aligned with local regulatory frameworks.

19

COVID-19: Affect recognition through voice analysis during the winter lockdown in Scotland

de la Fuente Garcia, S.; Haider, F.; Luz, S.

2021-05-09 health informatics 10.1101/2021.05.05.21256668 medRxiv

Top 0.1%

12.5%

Show abstract

The COVID-19 pandemic has led to unprecedented restrictions in peoples lifestyle which have affected their psychological wellbeing. In this context, this paper investigates the use of social signal processing techniques for remote assessment of emotions. It presents a machine learning method for affect recognition applied to recordings taken during the COVID-19 winter lockdown in Scotland (UK). This method is exclusively based on acoustic features extracted from voice recordings collected through home and mobile devices (i.e. phones, tablets), thus providing insight into the feasibility of monitoring peoples psychological wellbeing remotely, automatically and at scale. The proposed model is able to predict affect with a concordance correlation coefficient of 0.4230 (using Random Forest) and 0.3354 (using Decision Trees) for arousal and valence respectively. Clinical relevanceIn 2018/2019, 12% and 14% of Scottish adults reported depression and anxiety symptoms. Remote emotion recognition through home devices would support the detection of these difficulties, which are often underdiagnosed and, if untreated, may lead to temporal or chronic disability.

20

Evaluating User Experiences with an AI Chatbot for Health-Related Social Needs: A Cross-Sectional Mixed Methods Study

Sezgin, E.; Jackson, D. I.; Hussain, S.-A.; Kocaballi, A. B.; Skeens, M.; Fosler-Lussier, E.; Ewing, A.; Donneyong, M.; Pai, A.

2025-06-17 health informatics 10.1101/2025.06.16.25329054 medRxiv

Top 0.1%

12.4%

Show abstract

BackgroundUnmet Health-related social needs (HRSNs), such as food insecurity and housing, significantly impact health outcomes and wellbeing. Although screening tools are widely adopted to identify the needs, sustainable linkage to resources remains challenging. Conversational agents (chatbots) offer potential solutions for tailored and personalized feedback, real-time navigation, yet their usability and trustworthiness among populations with high needs require further exploration. MethodsWe conducted a mixed-methods study to evaluate user experiences with the DAPHNE(C) chatbot, which is designed to identify unmet HRSNs and provide personalized resource recommendations. Quantitative and qualitative data were collected from 128 caregivers with at least one dependent child. Online study design combined scenario/ task-based and free form chatbot use, to guide the engagement. Study measures included usability (SUS), task load (NASA-TLX), satisfaction (NPS), and trust. Qualitative analysis involved user feedback and user-chatbot conversation transcripts. We used regression and pairwise analyses to explore associations between demographic characteristics, self-reported unmet HRSNs and user experience outcomes (usability, satisfaction, task load and trust). ResultsMost participants were female (68%), aged 30-49 years (71%), and White (44%) or Black/African American (36%) and Hispanic/Latino (27%), relied on Medicaid/Medicare (83%), and cared for a child with, or had, special healthcare needs (78%). Participants reported high usability (SUS= 84.7, SD=12.4), low task load (NASA-TLX= 6.8, SD=2.8), high satisfaction (NPS= 8.0, SD=2.4), and high trust (Mean= 4.1, SD=0.8). Nearly all participants (98%) reported unmet HRSNs, including food insecurity (76%) and financial limitations (75%). Free-form chatbot conversation sessions averaged 3 minutes and [~]20 turns, with greater use of assistive buttons over typing. Furthermore, DAPHNE (using retrieval-augmented generation to ground every recommendation in a live social-care API) achieved 99 % intent accuracy in 1,523 message turns. Dialogues focused on financial, housing, and food needs. 94% of participants found the tool was helpful, while requesting design features like saved histories, voice interaction, and richer local resource details. Regression analysis showed usability and trust were broadly consistent across most demographic groups, though participants with higher education and lower income showed modest decrements in usability. Several HRSNs, including transportation and utility disruptions, were associated with higher trust and satisfaction, suggesting the assistant may hold particular value for users facing structural these barriers. DiscussionThe DAPHNE chatbot demonstrates potential as a useful tool for addressing HRSNs, with strong usability and trust among diverse populations. Future designs should focus on longitudinal impact assessments and effectiveness to enhance accessibility and address practical implementation challenges.